R packages

In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:

  • survival (version: 3.3.1)

R version 4.2.1 (2022-06-23 ucrt)


For this practical, we will use the heart and retinopathy data sets from the survival package. More details about the data sets can be found in:

Combinations of Functions and Loops

Common R Objects

Task 1

Examine the following code in R.

h <- function(k){
       if (k <= 20){
        3 * k
      } else {
      2 * k

Investigate what this code takes as an input (e.g. scalar, vector, matrix, array, data.frame, list) and what the output will be in every case.

Solution 1

h <- function(k){
       if (k <= 20){
        3 * k
      } else {
      2 * k

# scalar:
scalar_test <- 10

# vector:
vec_test <- c(1:10)

# matrix:
mat_test <- matrix(data = 1:6, nrow = 3, ncol = 3)
## Warning in matrix(data = 1:6, nrow = 3, ncol = 3): data length differs from size of matrix: [6 != 3
## x 3]
array_test <- array(data = 1:5, dim = c(2, 2, 2))

# data frame:
df_test <- data.frame(v1 = 1:10, v2 = sample(0:1, 10, replace = TRUE))

# list:
list_test <- list(l1 = 1:30, l2 = c("m", "f"), l3 = c(TRUE, TRUE, FALSE, FALSE, FALSE))

# Test the function (after you remove the #):
# h(scalar_test)
# h(vec_test)
# h(mat_test)
# h(array_test)
# h(df_test)
# h(list_test)

Task 2

Examine the following code in R.

h <- function(k){
  for(i in 1:length(k)){
    k[i]<- if (k[i] <= 20){
      3 * k[i]
    } else {
      2 * k[i] 

Investigate what this code takes as an input (e.g. scalar, vector, matrix, array, data.frame, list) and what the output will be in every case.

Solution 2

h <- function(k){
  for(i in 1:length(k)){
    k[i]<- if (k[i] <= 20){
      3 * k[i]
    } else {
      2 * k[i] 

# scalar:
scalar_test <- 10

# vector:
vec_test <- c(1:10)

# matrix:
mat_test <- matrix(data = 1:6, nrow = 3, ncol = 3)
## Warning in matrix(data = 1:6, nrow = 3, ncol = 3): data length differs from size of matrix: [6 != 3
## x 3]
array_test <- array(data = 1:5, dim = c(2, 2, 2))

# data frame:
df_test <- data.frame(v1 = 1:10, v2 = sample(0:1, 10, replace = TRUE))

# list:
list_test <- list(l1 = 1:30, l2 = c("m", "f"), l3 = c(TRUE, TRUE, FALSE, FALSE, FALSE))

# h(scalar_test)
# h(vec_test)
# h(mat_test)
# h(array_test)
# h(df_test)
# h(list_test)

Task 3

Define a function (with the name mysummary) that takes one parameter (let’s call it x) and investigates if it is a data frame, a factor, or a numeric variable and returns the message “This is a data.frame” if x is a data frame, “This is a factor” is x is a factor and “This is numeric” if x is numeric.

Use an if statement. When you want to display results from the function you can use the print(…) function.

Solution 3

mysummary <- function(x){
    print('This is a data.frame')
  }else if(is.factor(x)){
    print('This is a factor')
  }else if(is.numeric(x)){
    print('This is numeric')
df_test <- data.frame(v1 = 1:10, v2 = sample(0:1, 10, replace = TRUE))
vec1_test <- as.factor(c("y", "n", "n", "n"))
vec2_test <- 1:10
## [1] "This is a data.frame"
## [1] "This is a factor"
## [1] "This is numeric"

Data Exploration

Task 1

Extend the previous function mysummary to take again one parameter (let’s call it x) but to return also some descriptive statistics. We therefore modify the function in such a way that a frequency table is given for factors and the mean and standard deviation are given for numeric data. For a data frame, return only an extra message that indicates that this is not implemented yet.

Use the functions table(…), mean(…) and sd(…).

Solution 1

mysummary <- function(x){
    print('This is a data.frame')
    print('Not yet implemented')
  }else if(is.factor(x)){
    print('This is a factor')
  }else if(is.numeric(x)){
    print('This is numeric')
df_test <- data.frame(v1 = 1:10, v2 = sample(0:1, 10, replace = TRUE))
vec1_test <- as.factor(c("y", "n", "n", "n"))
vec2_test <- 1:10
## [1] "This is a data.frame"
## [1] "Not yet implemented"
## [1] "This is a factor"
## x
## n y 
## 3 1
## [1] "This is numeric"
## [1] "mean"
## [1] 5.5
## [1] "sd"
## [1] 3.02765

Task 2*

The previous function mysummary still does not do anything useful when we apply it on the whole data.frame. Extend the function to take again one parameter (let’s call it x) but to return also some descriptive statistics for the data frame.

This can be solved in the following way: whenever the function is called using a data.frame as argument, we use a for loop to loop over its columns. Within this loop we let the function call itself but now using the column as argument.

Solution 2*

mysummary <- function(x){
    print('This is a data.frame')
    for(i in 1:ncol(x)){
  }else if(is.factor(x)){
    print('This is a factor')
  }else if(is.numeric(x)){
    print('This is numeric')
df1_test <- data.frame(v1 = 1:10, v2 = sample(0:1, 10, replace = TRUE))
df2_test <- data.frame(v1 = 1:10, v2 = sample(c("no", "yes"), 10, replace = TRUE))
## [1] "This is a data.frame"
## [1] "v1"
## [1] "This is numeric"
## [1] "mean"
## [1] 5.5
## [1] "sd"
## [1] 3.02765
## [1] "v2"
## [1] "This is numeric"
## [1] "mean"
## [1] 0.2
## [1] "sd"
## [1] 0.421637
## [1] "This is a data.frame"
## [1] "v1"
## [1] "This is numeric"
## [1] "mean"
## [1] 5.5
## [1] "sd"
## [1] 3.02765
## [1] "v2"

Data Manipulation


Create a function with the name std_num that takes as input a data frame (let’s call it DF), standardizes all the numerical variables and returns the new data frame. Apply this function to the heart data set.

Use an if statement and a for loop. Note that you will have to transform some variables to factors first.


heart$event <- as.factor(heart$event)
heart$surgery <- as.factor(heart$surgery)
heart$transplant <- as.factor(heart$transplant)
heart$id <- as.factor(heart$id)

std_num <- function(DF){
     for (j in 1:dim(DF)[2]){
            if (is.numeric(DF[,j])){
             DF[,j] <- (DF[,j] - mean(DF[,j]))/sd(DF[,j])

Data Visualization

Task 1

Create a for loop that goes through the columns of the retinopathy data set and plots each column. If the column is a numerical variable then use a density plot, if the column is a categorical variable then use a barchart. Create a plot with multiple panels.

Use an if statement. Use the function par(mfrow = …) to create multiple plots. Note that you will have to transform some variables to factors first.

Solution 1

retinopathy$trt <- as.factor(retinopathy$trt)
retinopathy$status <- as.factor(retinopathy$status)

par(mfrow = c(3, 3))
for (j in 1:dim(retinopathy)[2]){
  if (is.numeric(retinopathy[,j])){
    plot(density(retinopathy[,j]), col = rgb(0,0,1,0.5))
    polygon(density(retinopathy[,j]), col = rgb(0,0,1,0.5), border = "blue")
  } else if (is.factor(retinopathy[,j])){

Task 2

Create a function that applies the previous code in any data set. The name of the function will be plot_summary and it will take as input a data frame (let’s call it dt) and the dimension of the panel that includes the plots. In particular, dim_row represents the rows and dim_col the columns. The function will return a plot for each column. If the column is a numerical variable then use a density plot, if the column is categorical variable then use a barchart. Apply this function to the retinopathy data set.

Use an if statement and a for loop. Use the function par(mfrow = …) to create multiple plots. Note that you will have to transform some variables to factors first.

Solution 2

retinopathy$trt <- as.factor(retinopathy$trt)
retinopathy$status <- as.factor(retinopathy$status)

plot_summary <- function(dt, dim_row, dim_col){
  par(mfrow = c(dim_row, dim_col))
  for (j in 1:dim(dt)[2]){
    if (is.numeric(dt[,j])){
      plot(density(dt[,j]), col = rgb(0,0,1,0.5))
      polygon(density(dt[,j]), col = rgb(0,0,1,0.5), border = "blue")
    } else if (is.factor(dt[,j])){
plot_summary(retinopathy, 3, 3)

Task 3*

Extend the previous function plot_summary to include also plot titles indicating the variable. Apply this function to the retinopathy data set.

Use the paste0(…) function.

Solution 3*

retinopathy$trt <- as.factor(retinopathy$trt)
retinopathy$status <- as.factor(retinopathy$status)

plot_summary <- function(dt, dim_row, dim_col){
  par(mfrow = c(dim_row, dim_col))
  for (j in 1:dim(dt)[2]){
    if (is.numeric(dt[,j])){
      plot(density(dt[,j]), col = rgb(0,0,1,0.5), main = paste0(colnames(dt)[j]))
      polygon(density(dt[,j]), col = rgb(0,0,1,0.5), border = "blue")
    } else if (is.factor(dt[,j])){
      plot(dt[,j], main = paste0(colnames(dt)[j]))
plot_summary(retinopathy, 3, 3)


Task 1

For each data set (heart and retinopathy), select only the observations that report an event.

Note that the variable name of the events is diffent for the two data sets.

Solution 1

heart[heart$event == 1, ]
retinopathy[retinopathy$status == 1, ]

Task 2

Create a function with the name dt_events that takes as input a list with data sets (let’s call it dt_list) and a vector indicating the name of the event variable for each data set (let’s call it event_name). The function will return a list consisting of each data set but including only the observations/rows that had an event. Create a list that consists of the data sets heart and retinopathy and apply the function.

Use a for loop. Note that the variable name of the events is different for the two data sets.

Solution 2

dt_events <- function(dt_lists, event_name){
  new_dt <- list()
  for (k in 1:length(dt_lists)) {
      dt <- dt_lists[[k]]
      new_dt[[k]] <- dt[dt[[event_name[k]]] == 1, ]

dt_events(dt_lists = list(heart, retinopathy), event_name = c("event", "status"))
The Apply family

Task 1

For each data set (heart and retinopathy), obtain the mean age per event status.

Solution 1

tapply(heart$age, heart$event, mean)
##         0         1 
## -3.421947 -1.270983
tapply(retinopathy$age, retinopathy$status, mean)
##        0        1 
## 20.35146 21.44516

Task 2

Create a function with the name summary_list that takes as input a list with data sets (let’s call it dt_list), a vector with the names of the continuous variables (let’s call it dt_cont) and a vector with the names of the categorical variables (let’s call it dt_cat) and returns the mean of the continuous variable per group (categorical variable) for each data set. Apply the function to the heart and retinopathy data sets. In particular, use the continuous variables year (from the heart data set) and risk (from the retinopathy data set) and the categorical variables surgery (from the heart data set) and type (from the retinopathy data set).

Use a for loop. Use the print(…) function to return the results.

Solution 2

summary_list <- function(dt_list, dt_cont, dt_cat){
  for (i in 1:length(dt_list)){
    dt <- dt_list[[i]]
    print(tapply(dt[[dt_cont[i]]], dt[[dt_cat[i]]], mean))

summary_list(dt_list = list(heart, retinopathy), dt_cont = c("year", "risk"), dt_cat = c("surgery", "type"))
##        0        1 
## 3.320975 4.105738 
## juvenile    adult 
## 9.745614 9.632530

Task 3*

Extend the previous by including an extra input argument (let’s call it calc) which will indicate each time the measure that we want to calculate e.g, the mean. So that the user can decide which function to use.

Solution 3*

summary_list <- function(dt_list, dt_cont, dt_cat, calc){
  for (i in 1:length(dt_list)){
    dt <- dt_list[[i]]
    print(tapply(dt[[dt_cont[i]]], dt[[dt_cat[i]]], calc))

summary_list(dt_list = list(heart, retinopathy), dt_cont = c("year", "risk"), dt_cat = c("surgery", "type"), calc = median)
##        0        1 
## 3.564682 3.978097 
## juvenile    adult 
##       10        9

